In our Capstone Project, we aim to explore the relationship between meteorological conditions (air temperature and rainfall) of Singapore and clinical conditions commonly seen in local community – Dengue fever (DF), hand food mouth disease (HFMD), upper respiratory tract infection (URTI) and diarrhea.
Specifically, we aim to answer the following questions: Does higher air temperature and rainfall associated with the rise in clinical conditions in the community? And which region of Singapore has the highest number of DF cases?
# Rule for activating packages in development mode:
# - Specify everything fully
# - Exceptions (these must be activated):
# - ggfortify - to overload `ggplot2::autoplot()`
# - ggmap - has to be activated for `ggmap::register_google()` to work
# - ggplot2 - too troublesome otherwise ("ggplot2::" in function arguments)
# - magrittr - `%>%`
#
# For deployment mode, activate every package that is referenced more than once
# - Add package name here, then use find-and-replace to remove "<package>::"
# - Remember what are the 8 packages in tidyverse?
#
# On second thought, let's just stick to development mode. We'll be paying a
# tiny penalty every time "<package>::" is used, and the source code would be
# slightly more verbose, but this might be more educational.
pacman::p_load(
ggfortify,
ggmap,
ggplot2,
magrittr
)
Most time-consuming, finding data, is.
All data are available online and first gathered into separate distinct raw datasets:
Click here for a summary page. The full glorious details of their providence are given below.
The following functions must be activated no matter which method of data import is chosen.
import_moh_weekly <- function(url_or_path = NULL) {
#' Weekly Infectious Diseases Bulletin
#'
#' @description
#' \href{https://www.moh.gov.sg/resources-statistics/infectious-disease-
#' statistics/2020/weekly-infectious-diseases-bulletin}{MOH Weekly Infectious
#' Disease Bulletin} from the Ministry of Health (MOH).
#'
#' \href{https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/
#' weekly-infectious-disease-bulletin-year-2020
#' d1092fcb484447bc96ef1722b16b0c08.xlsx}{Latest data as of 31 July 2020
#' (2012-W01 to 2020-W30)}.
#'
#' @param url_or_path The URL or file path of the .xlsx file.
#' @return Weekly infectious diseases bulletin (2012-W01 to 2020-W30).
# If no URL or path is specified, try to get the file directly from MOH.
if (is.null(url_or_path)) {
url_or_path = paste0(
"https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/",
"weekly-infectious-disease-bulletin-year-2020",
"d1092fcb484447bc96ef1722b16b0c08.xlsx"
)
}
# Check if the given path is a URL by trying to download to a temp file. If
# successful, return the temp file. If not, return the original path.
xlsx_file = tryCatch({
temp = tempfile(fileext = ".xlsx")
download.file(url_or_path, destfile = temp, mode = "wb")
temp
}, error = function(e) {
url_or_path
})
# Columns will be renamed to follow 2020
colnames_2020 = c(
"Campylobacter enterosis" = "Campylobacter enteritis",
"Campylobacterenterosis" = "Campylobacter enteritis",
"Campylobacteriosis" = "Campylobacter enteritis",
"Chikungunya Fever" = "Chikungunya",
"Dengue Haemorrhagic Fever" = "DHF",
"Dengue Fever" = "Dengue",
"Hand, Foot and Mouth Disease" = "HFMD",
"Hand, Foot Mouth Disease" = "HFMD",
"Nipah virus infection" = "Nipah",
"Viral Hepatitis A" = "Acute Viral Hepatitis A",
"Viral Hepatitis E" = "Acute Viral Hepatitis E",
"Zika Virus Infection" = "Zika",
"Zika virus infection" = "Zika"
)
xlsx_file %>%
readxl::excel_sheets() %>%
lapply(function(sheetname) {
df = readxl::read_xlsx(xlsx_file, sheetname, skip = 1)
# Date formats are different for 2020 (dmy instead of mdy)
if (sheetname == "2020") {
df$Start = lubridate::dmy(df$Start)
df$End = lubridate::dmy(df$End)
}
# Find and rename columns that need to be renamed, and rename them
mapper = na.omit(colnames_2020[names(df)])
dplyr::rename_with(df, ~mapper, names(mapper))
}) %>%
dplyr::bind_rows() %>%
dplyr::rename(Epiweek = `Epidemiology Wk`) %>%
dplyr::mutate(Epiyear = lubridate::epiyear(Start)) %>%
dplyr::select(Epiyear, everything()) %>%
dplyr::arrange(Start)
}
read_kmls <- function(url_or_path) {
#' Read Single or Multiple KML Files
#'
#' @description
#' There are (at least) 2 approaches to handling .kml data:
#' \enumerate{
#' \item sp - rgdal::readOGR()
#' \item sf - sf::st_read()
#' }
#'
#' The sp approach was abandoned as their objects were rather complex and
#' did not facilitate method chaining.
#'
#' The sf approach produced objects that look like data.frames, which had
#' better support for method chaining, but also some peculiarities:
#' \itemize{
#' \item Dimension: set to XY using sf::st_zm()
#' \item Geographic CRS: use sf::`st_crs<-`("WGS84") World Geodetic
#' Survey 1984
#' }
#'
#' @param url_or_path The URL(s) or file path(s) of the .kml file(s).
#' @return A single combined sf object.
# Check if the given paths are URLs by trying to download to temp files. If
# successful, return the temp files. If not, return the original paths.
# Automatically extract .zip files, if any.
kml_files = tryCatch({
temp = tempfile(fileext = paste0(".", tools::file_ext(url_or_path)))
Map(function(x, y) download.file(x, y, mode = "wb"), url_or_path, temp)
sapply(temp, function(x) {
if (endsWith(x, ".zip")) {
unzip(x)
} else {
x
}
})
}, error = function(e) {
url_or_path
})
kml_files %>%
lapply(sf::st_read, quiet = T) %>%
dplyr::bind_rows() %>%
tibble::as_tibble() %>%
sf::st_as_sf() %>%
sf::st_zm()
}
The following code chunk has been disabled, but is provided here to show the details behind the webscraping processes.
import_mss_daily <- function(years, stations = NULL) {
#' Historical Daily Weather Records
#'
#' @description
#' \href{http://www.weather.gov.sg/climate-historical-daily/}{Daily weather
#' records} from the Meteorological Service Singapore (MSS).
#' Available data ranges from January 1980 to June 2020. Only 19 of the 63
#' climate stations are recognized by this function, because these contain
#' more than just rainfall data. \href{http://www.weather.gov.sg/wp-content/
#' uploads/2016/12/Station_Records.pdf}{List of stations, weather parameters
#' and periods of records}.
#'
#' MSS is nice enough to have their data in .csv files, with a systematic
#' naming scheme. Data compilation becomes as simple as generating the
#' appropriate list of URLs.
#'
#' @param years A vector of the years of interest.
#' @param stations A vector of the climate station names.
#' @return Combined daily weather records.
#' @examples
#' import_mss_daily(2012:2020, "Changi")
#' import_mss_daily(2012:2020, c("Changi", "Clementi", "Khatib", "Newton"))
stations_lookup = c(
"Admiralty" = "104_",
"Ang Mo Kio" = "109_",
"Changi" = "24_",
"Choa Chu Kang (South)" = "121_",
"Clementi" = "50_",
"East Coast Parkway" = "107_",
"Jurong (West)" = "44_",
"Jurong Island" = "117_",
"Khatib" = "122_",
"Marina Barrage" = "108_",
"Newton" = "111_",
"Pasir Panjang" = "116_",
"Pulau Ubin" = "106_",
"Seletar" = "25_",
"Sembawang" = "80_",
"Sentosa Island" = "60_",
"Tai Seng" = "43_",
"Tengah" = "23_",
"Tuas South" = "115_"
)
# Check that all provided station names are in the list, if not, exit and
# print out the list (of names) for the user.
mask = !(stations %in% names(stations_lookup))
if (any(mask)) {
stop("The following station names are not recognized:\n",
paste(stations[mask], collapse = "\n"),
"\n\nPlease select from the following:\n",
paste(names(stations_lookup), collapse = "\n"))
}
# If no station names specified, take the full list
if (is.null(stations)) {
station_nums = stations_lookup
} else {
station_nums = stations_lookup[stations]
}
# The URL to each .csv file has the following format:
# - "<url_base><station number>_<YYYY><MM>.csv"
# - Notice the station numbers given above contain the "_" suffix
url_base = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_S"
# To enumerate all the URLs, take the Cartesian product of:
# - station numbers
# - years
# - months (we'll always attempt to find all 12 months)
#
# Flatten the result into a vector, then prefix and suffix with `url_base`
# and ".csv", respectively.
# Base R
urls = station_nums %>%
outer(years, FUN = paste0) %>%
outer(sprintf("%02d", 1:12), FUN = paste0) %>%
sort() %>% # Flatten and arrange in alphabetical order
paste0(url_base, ., ".csv")
# Equivalent tidyverse approach (for reference)
# urls = station_nums %>%
# tidyr::crossing(years, sprintf("%02d", 1:12)) %>%
# tidyr::unite("station_year_month", sep = "") %>%
# dplyr::pull() %>%
# paste0(url_base, ., ".csv")
# We have a list of URLs, but some of them may not exist (some stations do
# not have data for some months). Use tryCatch() to return a table for the
# URLs that exist, and a NA for those that don't.
#
# Problems with multibyte characters and countermeasures:
# - Some colnames contained the "degree sign" symbol which somehow made the
# table contents inaccessible. Tables after April 2020 had slightly
# different colnames.
# - Manually set the colnames for each table.
# - Some tables contained the "em dash" symbol to indicate missing values.
# - They will be coerced to NA when the columns are type cast to numeric.
# - There will be warning messages.
dfs = urls %>%
lapply(function(url) {
tryCatch({
df = readr::read_csv(url,
skip = 1,
col_names = c(
"Station",
"Year",
"Month",
"Day",
"Rainfall",
"Highest_30_min_rainfall",
"Highest_60_min_rainfall",
"Highest_120_min_rainfall",
"Mean_temp",
"Max_temp",
"Min_temp",
"Mean_wind",
"Max_wind"
),
col_types = readr::cols_only(
"Station" = "c",
"Year" = "n",
"Month" = "n",
"Day" = "n",
"Rainfall" = "n",
"Mean_temp" = "n",
"Max_temp" = "n",
"Min_temp" = "n"
))
# Announce progress (UX is important! We can tolerate lower efficiency)
message(paste(df[1, 1:3], collapse = "_"))
df
}, error = function(e) {
NA
})
})
dfs[!is.na(dfs)] %>%
dplyr::bind_rows() %>%
# Calculate daily temperature range
dplyr::mutate(Temp_range = Max_temp - Min_temp,
.keep = "unused") %>%
# Calculate epidemiological years and weeks
dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, Day, sep = "-")),
Epiyear = lubridate::epiyear(Date),
Epiweek = lubridate::epiweek(Date),
.keep = "unused",
.after = Station) %>%
dplyr::arrange(Station, Date)
}
import_hcidirectory <- function() {
#' Healthcare Institutions Directory
#'
#' @description
#' The \href{http://hcidirectory.sg/hcidirectory/}{Healthcare Institutions
#' (HCI) Directory}, an initiative by the Ministry of Health (MOH), is a
#' platform for all HCIs licensed under the Private Hospitals and Medical
#' Clinics (PHMC) Act to provide information about their services and
#' operations to the public.
#'
#' This function is custom-made to consolidate the names and addresses of
#' HCIs which are medical clinics that offer general medical services.
#'
#' The HCI Directory is a dynamic web page, so using RSelenium might be
#' required.
#'
#' @return The names and addresses of selected HCIs.
# Run a Selenium Server using `RSelenium::rsDriver()`. The parameters e.g.
# `browser`, `chromever` (or `geckover` if using Firefox, or other drivers
# if using other browsers) have to be properly set. Trial-and-error until a
# configuration works. Set `check = T` the very first time it's run on a
# system, then set `check = F` after that to speed things up.
rD = RSelenium::rsDriver(browser = "chrome",
chromever = "83.0.4103.39",
check = F)
# Connect to server with a remoteDriver instance.
remDr = rD$client
# Set timeout on waiting for elements
remDr$setTimeout(type = "implicit", milliseconds = 10000)
# Navigate to the given URL
remDr$navigate("http://hcidirectory.sg/hcidirectory/")
# Click 4 things:
# 1. "MORE SEARCH OPTIONS"
# 2. "Medical Clinics Only"
# 3. "General Medical"
# 4. "Search"
c(
"options" = "#moreSearchOptions",
"medclins" = "#criteria > table > tbody > tr:nth-child(2) > td > label",
"genmed" = "#isGenMed",
"search" = "#search_btn_left"
) %>%
lapply(remDr$findElement, using = "css") %>%
purrr::walk(function(elem) elem$clickElement())
# Find the number of pages
results = remDr$findElement("#results", using = "css")
npages = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html() %>%
rvest::html_node("#totalPage") %>%
rvest::html_attr("value") %>%
as.numeric()
# Create an empty tibble to append results
df = tibble::tibble(
id = character(),
name = character(),
add = character()
)
i = 1
while (T) {
results = remDr$findElement("#results", using = "css")
html = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html()
# Determine the index numbers of the (up to 10) results on the page
idx = html %>%
# Find the element that says "SHOWING 1 - 10 OF 1,761 RESULTS"
rvest::html_nodes(".col1") %>%
.[1] %>%
rvest::html_text() %>%
# Commas have to be eliminated for numbers > 999
gsub(",", "", .) %>%
# Find the smallest and largest numbers and apply the colon operator
sub(".*Showing\\s+(.*)\\s+of.*", "\\1", .) %>%
strsplit(split = " - ") %>%
unlist() %>%
as.numeric() %>%
{ .[1]:.[2] }
# Only append results if IDs are not in the table
if (!any(idx %in% df$id)) {
df = df %>%
dplyr::bind_rows(
html %>%
# Find both the name and address nodes
rvest::html_nodes(".name,.add") %>%
rvest::html_text() %>%
# Tidy whitespace
gsub("\\s+", " ", .) %>%
trimws() %>%
# Concatenate IDs, odd rows (names), and even rows (addresses)
{ cbind(idx, .[c(TRUE,FALSE)], .[!c(TRUE,FALSE)]) } %>%
tibble::as_tibble() %>%
setNames(c("id", "name", "add"))
)
# Announce progress and increment page counter
message(i, " of ", npages, " done (", round(i / npages * 100, 2), "%)")
i = i + 1
}
# Natural exit point
if (i > npages) break
# Navigate to the next page (if available, else stop)
the_end = tryCatch({
nextpage = remDr$findElement("#PageControl > div.r_arrow", using = "css")
nextpage$clickElement()
F
}, error = function(e) {
print(paste("There are no more pages after", i))
T
})
# Unnatural exit point
if (the_end) break
}
# Clean up RSelenium
remDr$close()
rD[["server"]]$stop()
rm(rD, remDr)
gc()
# Kill Java instance(s) inside RStudio
# docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
system("taskkill /im java.exe /f", intern = F, ignore.stdout = F)
# Clean up:
# - Franchises may have the same name with different addresses
# - Different practices may have the same zipcodes and even buildings
# - We will consider each full address unique, and a single practice
# Clean up duplicate addresses
df %>%
.[!duplicated(tolower(.$add), fromLast = T),]
}
zipcodes_to_geocodes <- function(zipcodes) {
#' Get Geo-location from Google Maps
#'
#' @description
#' Attempt to obtain the longitudes, latitudes, and addresses of the given
#' zipcodes using ggmap::geocode().
#'
#' @param zipcodes A vector of zipcodes.
#' @return Geo-location data of the associated zipcodes.
# Prompt user to input API key
ggmap::register_google(key = readline("Please enter Google API key: "))
# Create an (almost) empty tibble to append results
res = zipcodes %>%
# Remove duplicates to minimize number of requests
.[!duplicated(.)] %>%
tibble::as_tibble() %>%
dplyr::rename(zip = value) %>%
dplyr::mutate(lon = NA_real_,
lat = NA_real_,
address = NA_character_)
for (i in 1:nrow(res)) {
result = tryCatch({
ggmap::geocode(res$zip[i], output = "latlona", source = "google")
}, warning = function(w) {
w$message
}, error = function(e) {
NA
})
# If the registered key is invalid, there's no point continuing
if (grepl("The provided API key is invalid", result[1], fixed = T)) {
stop("A valid Google API key is required.")
}
# A useful result will have something, and will have names
if (!is.na(result) && !is.null(names(result))) {
res$lon[i] = result$lon
res$lat[i] = result$lat
res$address[i] = result$address
}
# Announce progress
message(i, " of ", nrow(res), " (",round(i / nrow(res) * 100, 2), "%)")
}
res
}
The following code chunk has been disabled, but is provided here to show how the data may be obtained from primary sources (with one exception). Using this code chunk would require activating the previous code chunk, as well as a valid Google Maps Platform API key.
grand_import_de_novo <- function() {
# This might take a while, and requires a Google Maps Platform API key
list(
"moh_bulletin" = import_moh_weekly(),
"mss_19stations" = import_mss_daily(years = 2012:2020),
"hci_clinics" = import_hcidirectory() %>%
dplyr::mutate(zip = sub(".*((?i)singapore\\s+\\d+).*", "\\1", add)) %>%
# Requires Google Maps Platform API key
dplyr::left_join(zipcodes_to_geocodes(.$zip), by = "zip") %>%
dplyr::select(-id, -zip, -address) %>%
tidyr::drop_na(),
"planning_areas" = read_kmls(paste0(
"https://geo.data.gov.sg/plan-bdy-dwelling-type-2017/2017/09/27/kml/",
"plan-bdy-dwelling-type-2017.kml"
)),
"regions" = read_kmls(paste0(
"https://geo.data.gov.sg/mp14-region-web-pl/2014/12/05/kml/",
"mp14-region-web-pl.zip"
)),
"dengue_polys" = read_kmls(
c("central", "northeast", "southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/denguecase-", .,
"-area/2020/07/28/kml/denguecase-", ., "-area.kml")
),
"aedes_polys" = read_kmls(
c("central", "northeast", "northwest", "southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/breedinghabitat-", .,
"-area/2020/07/28/kml/breedinghabitat-", ., "-area.kml")
),
"mss_63station_pos" = readr::read_csv(paste0(
"https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/",
"data/mss/Station_Records.csv"
))
)
}
# raw_data <- grand_import_de_novo()
If the data files are available locally, set from_online_repo = F for faster loading. Otherwise, the file will be obtained from an online repository. The .csv files will be read directly while the other file formats will be downloaded to a temporary folder and read locally.
grand_import_no_webscraping <- function(from_online_repo = TRUE) {
# Allow user to choose whether to import raw data from an online repository
# or from local files.
if (from_online_repo) {
fld = paste0("https://raw.githubusercontent.com/roscoelai/",
"dasr2020capstone/master/data/")
} else {
# Check that the "../data/ folder exists
assertthat::assert_that(dir.exists("../data/"),
msg = 'Unable to locate "../data/" directory.')
fld = "../data/"
}
list(
"moh_bulletin" = import_moh_weekly(paste0(
fld, "moh/weekly-infectious-disease-bulletin-year-2020.xlsx"
)),
"mss_19stations" = readr::read_csv(paste0(
fld, "mss/mss_daily_2012_2020_19stations_20200728.csv"
)),
"hci_clinics" = readr::read_csv(paste0(
fld, "hcid/hci_clinics_20200725.csv"
)),
"planning_areas" = read_kmls(paste0(
fld, "kmls/plan-bdy-dwelling-type-2017.kml"
)),
"regions" = read_kmls(paste0(
fld, "kmls/MP14_REGION_WEB_PL.kml"
)),
"dengue_polys" = read_kmls(paste0(
fld, "kmls/denguecase-", c("central",
"northeast",
"southeast",
"southwest"), "-area.kml"
)),
"aedes_polys" = read_kmls(paste0(
fld, "kmls/breedinghabitat-", c("central",
"northeast",
"northwest",
"southeast",
"southwest"), "-area.kml"
)),
"mss_63station_pos" = readr::read_csv(paste0(
fld, "mss/Station_Records.csv"
))
)
}
raw_data <- grand_import_no_webscraping(from_online_repo = T)
raw_data %>%
tibble::tibble(
names = names(.),
data = .,
nrow = sapply(., nrow),
ncol = sapply(., ncol)
)
# A tibble: 8 x 4
names data nrow ncol
<chr> <named list> <int> <int>
1 moh_bulletin <tibble [470 x 46]> 470 46
2 mss_19stations <tibble [58,489 x 7]> 58489 7
3 hci_clinics <tibble [1,739 x 4]> 1739 4
4 planning_areas <tibble [55 x 3]> 55 3
5 regions <tibble [5 x 3]> 5 3
6 dengue_polys <tibble [1,203 x 3]> 1203 3
7 aedes_polys <tibble [315 x 3]> 315 3
8 mss_63station_pos <tibble [63 x 9]> 63 9
raw_data
$moh_bulletin
# A tibble: 470 x 46
Epiyear Epiweek Start End Cholera Paratyphoid
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl>
1 2012 1 2012-01-01 00:00:00 2012-01-07 00:00:00 0 1
2 2012 2 2012-01-08 00:00:00 2012-01-14 00:00:00 0 1
3 2012 3 2012-01-15 00:00:00 2012-01-21 00:00:00 0 1
4 2012 4 2012-01-22 00:00:00 2012-01-28 00:00:00 0 1
5 2012 5 2012-01-29 00:00:00 2012-02-04 00:00:00 0 3
6 2012 6 2012-02-05 00:00:00 2012-02-11 00:00:00 0 2
7 2012 7 2012-02-12 00:00:00 2012-02-18 00:00:00 0 1
8 2012 8 2012-02-19 00:00:00 2012-02-25 00:00:00 0 4
9 2012 9 2012-02-26 00:00:00 2012-03-03 00:00:00 0 4
10 2012 10 2012-03-04 00:00:00 2012-03-10 00:00:00 0 2
# ... with 460 more rows, and 40 more variables: Typhoid <dbl>, `Acute Viral
# Hepatitis A` <dbl>, `Acute Viral Hepatitis E` <dbl>, Poliomyelitis <dbl>,
# Plague <dbl>, `Yellow Fever` <dbl>, Dengue <dbl>, DHF <dbl>, Malaria <dbl>,
# Chikungunya <dbl>, HFMD <dbl>, Diphtheria <dbl>, Measles <dbl>,
# Mumps <dbl>, Rubella <dbl>, SARS <dbl>, Nipah <dbl>, `Acute Viral hepatitis
# B` <dbl>, Encephalitis <dbl>, Legionellosis <dbl>, `Campylobacter
# enteritis` <dbl>, `Acute Viral hepatitis C` <dbl>, Leptospirosis <dbl>,
# Melioidosis <dbl>, `Meningococcal Infection` <dbl>, Pertussis <dbl>,
# `Pneumococcal Disease (invasive)` <dbl>, `Haemophilus influenzae type
# b` <dbl>, `Salmonellosis(non-enteric fevers)` <dbl>, `Avian
# Influenza` <dbl>, Zika <dbl>, `Ebola Virus Disease` <dbl>, `Japanese
# Encephalitis` <dbl>, Tetanus <dbl>, Botulism <dbl>, `Murine Typhus` <dbl>,
# `Acute Upper Respiratory Tract infections` <dbl>, `Acute
# Conjunctivitis` <dbl>, `Acute Diarrhoea` <dbl>, Chickenpox <dbl>
$mss_19stations
# A tibble: 58,489 x 7
Station Date Epiyear Epiweek Rainfall Mean_temp Temp_range
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Admiralty 2012-01-01 2012 1 0 27.3 7.40
2 Admiralty 2012-01-02 2012 1 0 27.1 5.8
3 Admiralty 2012-01-03 2012 1 0 26.9 4.60
4 Admiralty 2012-01-04 2012 1 0 26.8 7
5 Admiralty 2012-01-05 2012 1 0 26.4 6.5
6 Admiralty 2012-01-06 2012 1 0 27.1 6.40
7 Admiralty 2012-01-07 2012 1 1.4 26.5 6.20
8 Admiralty 2012-01-08 2012 2 3.4 24.8 2.40
9 Admiralty 2012-01-09 2012 2 2.6 25.4 3.4
10 Admiralty 2012-01-10 2012 2 8.6 24.7 2.80
# ... with 58,479 more rows
$hci_clinics
# A tibble: 1,739 x 4
name add lon lat
<chr> <chr> <dbl> <dbl>
1 DA CLINIC @ BISHAN BLK 501 BISHAN STREET 11 #01-374 Singapor~ 104. 1.35
2 SINGHEALTH POLYCLINIC~ 1 TAMPINES STREET 41 TAMPINES POLYCLINIC ~ 104. 1.36
3 1 MEDICAL TECK GHEE BLK 410 ANG MO KIO AVENUE 10 TECK GHEE SQ~ 104. 1.36
4 115 EASTPOINT CLINIC ~ BLK 115 BEDOK NORTH RD #01-301 Singapore ~ 104. 1.33
5 18 CLINIC BLK 101 TOWNER ROAD #01-228 Singapore 322~ 104. 1.32
6 1AESTHETICS, MEDICAL ~ 8 EU TONG SEN STREET THE CENTRAL #14-90 S~ 104. 1.29
7 326 AVENUE 3 CLINIC BLK 326 SERANGOON AVE 3 #01-382 Singapore~ 104. 1.35
8 338 FAMILY CLINIC BLK 338 ANG MO KIO AVE 1 #01-1615 Singapo~ 104. 1.36
9 57 MEDICAL CLINIC (GE~ BLK 57 GEYLANG BAHRU #01-3505 Singapore 3~ 104. 1.32
10 8 MEDICAL AESTHETIC C~ 51 CUPPAGE ROAD #01-03 Singapore 229469 104. 1.30
# ... with 1,729 more rows
$planning_areas
Simple feature collection with 55 features and 2 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 55 x 3
Name Description geometry
<chr> <chr> <MULTIPOLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ (((103.8572 1.396537, 103.8574 1.39629~
2 kml_2 "<center><table><tr><th colsp~ (((103.9319 1.343094, 103.9355 1.33956~
3 kml_3 "<center><table><tr><th colsp~ (((103.8492 1.362753, 103.8494 1.36268~
4 kml_4 "<center><table><tr><th colsp~ (((103.6973 1.307544, 103.6973 1.30755~
5 kml_5 "<center><table><tr><th colsp~ (((103.7641 1.370011, 103.7644 1.36947~
6 kml_6 "<center><table><tr><th colsp~ (((103.8172 1.294353, 103.8174 1.29433~
7 kml_7 "<center><table><tr><th colsp~ (((103.7745 1.390289, 103.775 1.386071~
8 kml_8 "<center><table><tr><th colsp~ (((103.7977 1.348128, 103.7981 1.34778~
9 kml_9 "<center><table><tr><th colsp~ (((103.9018 1.309745, 103.9015 1.30954~
10 kml_10 "<center><table><tr><th colsp~ (((103.9081 1.309816, 103.9082 1.30910~
# ... with 45 more rows
$regions
Simple feature collection with 5 features and 2 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 5 x 3
Name Description geometry
<chr> <chr> <MULTIPOLYGON [°]>
1 CENTRAL ~ "<html xmlns:fo=\"http://ww~ (((103.8487 1.363027, 103.8482 1.36328~
2 EAST REG~ "<html xmlns:fo=\"http://ww~ (((103.9532 1.382015, 103.9528 1.38216~
3 NORTH RE~ "<html xmlns:fo=\"http://ww~ (((103.7766 1.45145, 103.7766 1.45142,~
4 NORTH-EA~ "<html xmlns:fo=\"http://ww~ (((103.8971 1.415018, 103.8971 1.41503~
5 WEST REG~ "<html xmlns:fo=\"http://ww~ (((103.6973 1.307544, 103.6973 1.30754~
$dengue_polys
Simple feature collection with 1203 features and 2 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 103.6848 ymin: 1.269624 xmax: 103.967 ymax: 1.414322
geographic CRS: WGS 84
# A tibble: 1,203 x 3
Name Description geometry
<chr> <chr> <POLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ ((103.843 1.322077, 103.843 1.323886, ~
2 kml_2 "<center><table><tr><th colsp~ ((103.8484 1.327504, 103.8484 1.329312~
3 kml_3 "<center><table><tr><th colsp~ ((103.8681 1.327503, 103.8681 1.329312~
4 kml_4 "<center><table><tr><th colsp~ ((103.843 1.3474, 103.843 1.349208, 10~
5 kml_5 "<center><table><tr><th colsp~ ((103.8448 1.3474, 103.8448 1.349208, ~
6 kml_6 "<center><table><tr><th colsp~ ((103.8591 1.322077, 103.8591 1.323886~
7 kml_7 "<center><table><tr><th colsp~ ((103.8699 1.322077, 103.8699 1.323886~
8 kml_8 "<center><table><tr><th colsp~ ((103.8322 1.320269, 103.8322 1.322077~
9 kml_9 "<center><table><tr><th colsp~ ((103.834 1.320269, 103.834 1.322077, ~
10 kml_10 "<center><table><tr><th colsp~ ((103.8555 1.320269, 103.8555 1.322077~
# ... with 1,193 more rows
$aedes_polys
Simple feature collection with 315 features and 2 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 103.6848 ymin: 1.27505 xmax: 103.9652 ymax: 1.446879
geographic CRS: WGS 84
# A tibble: 315 x 3
Name Description geometry
<chr> <chr> <POLYGON [°]>
1 kml_1 "<center><table><tr><th colsp~ ((103.8537 1.370913, 103.8537 1.372722~
2 kml_2 "<center><table><tr><th colsp~ ((103.8448 1.367296, 103.8448 1.369104~
3 kml_3 "<center><table><tr><th colsp~ ((103.8358 1.378148, 103.8358 1.379957~
4 kml_4 "<center><table><tr><th colsp~ ((103.8861 1.378147, 103.8861 1.379956~
5 kml_5 "<center><table><tr><th colsp~ ((103.8537 1.365487, 103.8537 1.367296~
6 kml_6 "<center><table><tr><th colsp~ ((103.8466 1.370913, 103.8466 1.372722~
7 kml_7 "<center><table><tr><th colsp~ ((103.8717 1.37453, 103.8717 1.376339,~
8 kml_8 "<center><table><tr><th colsp~ ((103.8843 1.376339, 103.8843 1.378147~
9 kml_9 "<center><table><tr><th colsp~ ((103.8897 1.390808, 103.8897 1.392617~
10 kml_10 "<center><table><tr><th colsp~ ((103.8286 1.314842, 103.8286 1.316651~
# ... with 305 more rows
$mss_63station_pos
# A tibble: 63 x 9
Station `Lat.(N)` `Long. (E)` `Period of Dail~ `Period of 30,6~
<chr> <dbl> <dbl> <chr> <chr>
1 Paya L~ 1.35 104. Jan 1980-current <NA>
2 Macrit~ 1.34 104. Jan 1980-current Jan 2014-current
3 Lower ~ 1.37 104. Jan 2010-current Jan 2014-current
4 Tengah 1.39 104. Jan 1980-current <NA>
5 Changi 1.37 104. Jan 1981-current Jan 2014-current
6 Seletar 1.42 104. Jan 1980-current <NA>
7 Pasir ~ 1.39 104. Jan 1980-current Jan 2014-current
8 Kampon~ 1.27 104. Jan 1980-current Jan 2014-Oct 20~
9 Jurong~ 1.31 104. Jan 1980-current Jan 2014-current
10 Ulu Pa~ 1.33 104. Jan 1980-current Jan 2014-current
# ... with 53 more rows, and 4 more variables: `Period of Mean
# Temperature` <chr>, `Period of Max and Min Temperature` <chr>, `Period of
# Mean Wind Speed` <chr>, `Period of Max Wind Speed` <chr>
transpose_html_table <- function(html) {
xml2::read_html(html) %>%
rvest::html_node("table") %>%
rvest::html_table() %>%
t() %>%
`colnames<-`(.[1,]) %>%
.[2,]
}
idw_interpolation <- function(points, points_label,
polys, polys_label,
ordinal = 2) {
#' Inverse-distance-weighted (IDW) Interpolation
#'
#' @description
#' Calculate the inverse-distance-weighted (IDW) averages of the values of a
#' given set of points to the centroids of a given set of polygons.
#'
#' The ordinal determines the amount of weight given to inverse-distances.
#' An ordinal of 0 will return the same value (the arithmetic mean) to all
#' centroids. A large ordinal will heavily penalize points that are far away.
#' A negative ordinal will result in distance-weighted interpolation - please
#' don't do that.
#'
#' @param points sf object containing point features.
#' @param polys sf object containing polygon features.
#' @param points_label Column name of the labels in points.
#' @param polys_label Column name of the labels in polys.
#' @param ordinal Determines the amount of weight given to inverse-distances.
#' @return Interpolated values for each polygon.
if (ordinal < 0) {
stop("You are highly encouraged to use a non-negative ordinal.")
}
# weights: (nrow(polys) x nrow(points))
weights = polys %>%
sf::st_centroid() %>%
sf::st_distance(points) %>%
units::set_units(km) %>%
{ 1 / (.^ordinal) }
# values: (nrow(points) x ncol(values))
values = points %>%
as.data.frame() %>%
dplyr::select(-{{ points_label }}, -geometry) %>%
as.matrix()
# result: (nrow(polys) x ncol(values))
result = tibble::as_tibble(weights %*% values / rowSums(weights))
polys %>%
tibble::as_tibble() %>%
dplyr::select({{ polys_label }}) %>%
dplyr::bind_cols(result)
}
find_superset <- function(sub, super, super_label, sub_label = ".") {
sub %>%
sf::st_centroid() %>%
sf::st_intersects(super) %>%
tibble::as_tibble() %>%
dplyr::transmute(
{{ sub_label }} := sub[[sub_label]][row.id],
{{ super_label }} := super[[super_label]][col.id]
)
}
# Store intermediate processed data in this list
data <- list()
We work towards 2 datasets:
data_time containing epidemiological and meteorological data from 2012 to 2020data_space containing number of dengue cases, Aedes mosquito habitats, clinics, population dwelling types, and most recent meteorological data broken down by the planning areas in Singaporedata_timeThe unit of analysis for data_time is the epidemiological week.
The epidemiological data (MOH bulletin) is already organized this way.
The meteorological data is organized by days for each of 19 climate stations. To match the epidemiological data, it has to be aggregated to obtain representative values at the national level by epidemiological weeks.
In this way, how many values are we aggregating for each variable, for each week?
raw_data$mss_19stations %>%
tidyr::pivot_longer(Rainfall:Temp_range) %>%
tidyr::drop_na() %>%
dplyr::count(Epiyear, Epiweek, name) %>%
dplyr::select(name, n) %T>%
{ print(table(.)) } %>%
ggplot(aes(x = n, color = name)) +
geom_histogram(binwidth = 1) +
facet_grid(name ~ ., scales = "free") +
labs(x = "Number of values", y = "Number of weeks") +
theme(legend.position = "none")
n
name 46 51 56 71 73 81 86 89 90 92 94 95 96 97 98 99
Mean_temp 1 0 0 1 0 2 1 1 3 2 2 3 2 2 4 5
Rainfall 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Temp_range 0 1 0 0 1 1 0 0 1 0 0 2 1 0 1 4
n
name 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
Mean_temp 4 4 3 7 7 5 12 10 14 26 16 16 18 5 4 1
Rainfall 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 0
Temp_range 0 2 0 2 1 0 2 1 2 8 4 5 5 4 8 4
n
name 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
Mean_temp 2 8 4 6 1 1 5 6 13 9 66 2 10 11 16 23
Rainfall 2 4 7 7 6 2 3 4 11 9 70 8 15 11 23 19
Temp_range 4 14 11 20 16 19 19 15 25 7 51 3 2 4 5 7
n
name 132 133
Mean_temp 25 55
Rainfall 38 195
Temp_range 10 152
Most weeks will be aggregating more than 100 values for all 3 variables.
Now we join the datasets to get data_time.
data[["moh_bulletin_s"]] <- raw_data$moh_bulletin %>%
dplyr::select(
Epiyear:End,
Dengue,
HFMD,
`Acute Upper Respiratory Tract infections`,
`Acute Diarrhoea`
) %>%
dplyr::rename(
URTI = `Acute Upper Respiratory Tract infections`,
Diarrhoea = `Acute Diarrhoea`
)
Take a break to clean data for MOH bulletin
# Find the row(s) with error(s)
mask <- data$moh_bulletin_s$End %>%
{ . < "2012-01-01" | . > "2021-01-03" }
# Personally inspect the erroneous row(s)
data$moh_bulletin_s[mask,]
# A tibble: 1 x 8
Epiyear Epiweek Start End Dengue HFMD URTI
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 2018 1 2017-12-31 00:00:00 1900-01-01 00:00:00 83 326 3158.
# ... with 1 more variable: Diarrhoea <dbl>
# Correct error(s)
data$moh_bulletin_s[mask, "End"] <-
data$moh_bulletin_s$Start[mask] + lubridate::days(6)
# Check correction(s)
data$moh_bulletin_s[mask,]
# A tibble: 1 x 8
Epiyear Epiweek Start End Dengue HFMD URTI
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 2018 1 2017-12-31 00:00:00 2018-01-06 00:00:00 83 326 3158.
# ... with 1 more variable: Diarrhoea <dbl>
data[["mss_19stations_s"]] <- raw_data$mss_19stations %>%
dplyr::group_by(Epiyear, Epiweek) %>%
dplyr::summarise(
Mean_rainfall = mean(Rainfall, na.rm = T),
Med_rainfall = median(Rainfall, na.rm = T),
Mean_temp = mean(Mean_temp, na.rm = T),
Med_temp = median(Mean_temp, na.rm = T),
Mean_temp_rng = mean(Temp_range, na.rm = T),
Med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop"
)
data_time <- dplyr::left_join(
data$moh_bulletin_s,
data$mss_19stations_s,
by = c("Epiyear", "Epiweek")
) %>%
tidyr::drop_na()
data_time
# A tibble: 444 x 14
Epiyear Epiweek Start End Dengue HFMD URTI
<dbl> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 2012 1 2012-01-01 00:00:00 2012-01-07 00:00:00 74 326 2932.
2 2012 2 2012-01-08 00:00:00 2012-01-14 00:00:00 64 346 3189.
3 2012 3 2012-01-15 00:00:00 2012-01-21 00:00:00 60 463 3185.
4 2012 4 2012-01-22 00:00:00 2012-01-28 00:00:00 50 372 4001.
5 2012 5 2012-01-29 00:00:00 2012-02-04 00:00:00 84 444 3356.
6 2012 6 2012-02-05 00:00:00 2012-02-11 00:00:00 87 683 3358.
7 2012 7 2012-02-12 00:00:00 2012-02-18 00:00:00 65 810 3033.
8 2012 8 2012-02-19 00:00:00 2012-02-25 00:00:00 50 989 2953.
9 2012 9 2012-02-26 00:00:00 2012-03-03 00:00:00 55 1115 2855.
10 2012 10 2012-03-04 00:00:00 2012-03-10 00:00:00 45 1139 2868.
# ... with 434 more rows, and 7 more variables: Diarrhoea <dbl>,
# Mean_rainfall <dbl>, Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>,
# Mean_temp_rng <dbl>, Med_temp_rng <dbl>
dplyr::glimpse(data_time)
Rows: 444
Columns: 14
$ Epiyear <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
data_time contains 10 features: 4 indicators of disease numbers and 6 aggregated meteorological measures. Each value is associated with an epidemiological week. The time period spans from 2012-W01 to 2020-W30. As the meteorological records for July 2020 have not been published, there are no corresponding values from 2020-W28 onward.
data_spaceA lot more wrangling is involved.
The unit of analysis for data_space is the planning area. This is specified by raw_data$planning_areas, and the other datasets would have to be transformed to match before merging.
Spatial analysis would be rather limited and more emphasis placed on visualization.
# Polygons are 200 m X 200 m squares. The `Description` contains the number
# of occurrences within each polygon. Use a suitable regular expression
# pattern to extract the number for each polygon. Find the centroid for
# each polygon, then duplicate each row by the number of occurrences.
data[c("dengue_points", "aedes_points")] <-
raw_data[c("dengue_polys", "aedes_polys")] %>%
lapply(function(sf_tbl_df) {
sf_tbl_df %>%
dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>%
sf::st_centroid() %>%
.[rep(1:nrow(.), as.numeric(.$n)),]
})
data[["clinic_points"]] <- raw_data$hci_clinics %>%
sf::st_as_sf(coords = c("lon", "lat")) %>%
sf::`st_crs<-`("WGS84")
data[["weather_points"]] <- raw_data$mss_19stations %>%
dplyr::filter(Epiyear == 2020) %>%
dplyr::filter(Epiweek > max(Epiweek) - 4) %>%
dplyr::group_by(Station) %>%
dplyr::summarise(mean_rainfall = mean(Rainfall, na.rm = T),
med_rainfall = median(Rainfall, na.rm = T),
mean_temp = mean(Mean_temp, na.rm = T),
med_temp = median(Mean_temp, na.rm = T),
mean_temp_rng = mean(Temp_range, na.rm = T),
med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop") %>%
tidyr::drop_na() %>%
dplyr::left_join(
dplyr::select(raw_data$mss_63station_pos, Station:`Long. (E)`),
by = "Station"
) %>%
sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>%
sf::`st_crs<-`("WGS84")
data[["planning_areas"]] <- raw_data$planning_areas %>%
dplyr::bind_cols(
.$Description %>%
lapply(transpose_html_table) %>%
dplyr::bind_rows()
) %>%
tibble::as_tibble() %>%
dplyr::rename_all(tolower) %>%
dplyr::rename(plan_area = pln_area_n,
pop = total) %>%
dplyr::mutate(plan_area = tools::toTitleCase(tolower(plan_area)),
dplyr::across(pop:others, as.numeric)) %>%
dplyr::select(-matches("^(name|description|inc|fmel)")) %>%
sf::st_as_sf()
data[["regions"]] <- raw_data$regions %>%
dplyr::mutate(Name = tools::toTitleCase(tolower(Name)),
Name = sub(" Region", "", Name)) %>%
dplyr::rename(region = Name) %>%
dplyr::select(-Description)
# This is long
data_space <- data$planning_areas %>%
dplyr::select(-one_to_two_rm:-five_rm_exec_flats, -others) %>%
# Regions
dplyr::left_join(
# Changi Bay and Western Islands not assigned, but inconsequential, for now
find_superset(
sub = .,
super = data$regions,
super_label = "region",
sub_label = "plan_area"
),
by = "plan_area"
) %>%
# Counts
dplyr::left_join(
list(
ncases = data$dengue_points,
nhabs = data$aedes_points,
nclinics = data$clinic_points
) %>% {
lapply(names(.), function(name) {
find_superset(
sub = .[[name]],
super = data$planning_areas,
super_label = "plan_area"
) %>%
dplyr::count(plan_area, name = name)
})
} %>%
Reduce(function(x, y) dplyr::left_join(x, y, by = "plan_area"), .)
) %>%
dplyr::select(plan_area, region, ncases, nhabs, nclinics, everything()) %>%
# Meteorological data
dplyr::left_join(
idw_interpolation(
points = data$weather_points,
points_label = "Station",
polys = .,
polys_label = "plan_area",
ordinal = 10
),
by = "plan_area"
) %>%
# Calculations
dplyr::mutate(
# Combine East with North-East to increase N for comparison
region = ifelse(region == "East", "North-East", region),
area_km2 = units::set_units(sf::st_area(.), km^2),
ncases_p100k = ncases / pop * 100000,
nclinics_p100k = nclinics / pop * 100000,
hdb_pp = hdb / pop,
condos_other_apts_pp = condos_other_apts / pop,
landed_properties_pp = landed_properties / pop,
pop_pkm2 = as.numeric(pop / area_km2),
ncases_pkm2 = as.numeric(ncases / area_km2),
nclinics_pkm2 = as.numeric(nclinics / area_km2),
nhabs_pkm2 = as.numeric(nhabs / area_km2),
label = paste0(
"<b>", plan_area, "</b><br/>",
"Area (km<sup>2</sup>): ", round(area_km2, 2), "<br/>",
"Population: ", pop, "<br/>",
"Cases: ", ncases, "<br/>",
"Clinics: ", nclinics, "<br/>",
"<i>Aedes</i> habitats: ", nhabs, "<br/>",
"Cases (per 100k): ", round(ncases_p100k, 2), "<br/>",
"Cases (per km<sup>2</sup>): ", round(ncases_pkm2, 2)
)
)
data_space
dplyr::glimpse(data_space)
data_space contains 14 main and 9 derived features for each planning area (URA MP14), including: number of dengue cases, number of Aedes mosquito habitats, number of clinics, population, various dwelling types, area, and meteorological variables. Dengue case data were unavailable for several planning areas, especially since data for the north-west region were not published.
plot_heatmap <- function(.data, y, low_color, high_color, subtitle = "") {
.data %>%
dplyr::mutate(
month = factor(
lubridate::month(End),
levels = as.character(1:12),
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
ordered = TRUE
)
) %>%
dplyr::select(Epiyear, month, {{ y }}) %>%
tidyr::pivot_longer({{ y }}) %>%
dplyr::group_by(Epiyear, month) %>%
dplyr::summarise(Cases = sum(value),
.groups = "drop") %>%
ggplot(aes(x = month, y = Cases, fill = Cases)) +
geom_col() +
scale_fill_gradient(low = low_color, high = high_color) +
facet_grid(Epiyear ~ .) +
labs(
title = paste0("Number of ", y, " Cases from 2012-2020"),
subtitle = subtitle,
x = "Month",
y = "Cases",
caption = "Source: Ministry of Health"
) +
theme(legend.position = "none")
}
plot_choropleths <- function(.data, .vars) {
gridExtra::grid.arrange(grobs = lapply(.vars, function(var) {
ggplot(aes(fill = .data[[var]]), data = .data) +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 10,
barheight = .5
)) +
theme_void() +
theme(legend.position = "bottom")
}), ncol = length(.vars))
}
plot_ggbetweenstats <- function(data, x, y, xlab, ylab) {
type = ifelse(shapiro.test(data_space[[y]])$p.value < 0.05, "np", "p")
ggstatsplot::ggbetweenstats(
data = data,
x = {{ x }},
y = {{ y }},
xlab = {{ xlab }},
ylab = {{ ylab }},
type = type,
plot.type = "box",
effsize.type = "biased",
nboot = 50,
pairwise.comparisons = T,
pairwise.display = "significant",
pairwise.annotation = "p.value",
p.adjust.method = "bonferroni"
)
}
animate_choropleth <- function(raw_data, .var = "Med_temp", .title = ".") {
weather_nested = raw_data$mss_19stations %>%
dplyr::mutate(Year = lubridate::year(Date),
Month = lubridate::month(Date)) %>%
dplyr::group_by(Year, Month, Station) %>%
dplyr::summarise(Mean_rainfall = mean(Rainfall, na.rm = T),
Med_temp = median(Mean_temp, na.rm = T),
Med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop") %>%
tidyr::drop_na() %>%
dplyr::left_join(
dplyr::select(raw_data$mss_63station_pos, Station:`Long. (E)`),
by = "Station"
) %>%
sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>%
sf::`st_crs<-`("WGS84") %>%
dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, "01", sep = "-"))) %>%
dplyr::select(Date, everything()) %>%
dplyr::group_by(Year, Month) %>%
tidyr::nest()
pa_polys = raw_data$planning_areas %>%
dplyr::select(-Name) %>%
dplyr::mutate(plan_area = sub(".*d>(.*?)<.*", "\\1", Description),
plan_area = tools::toTitleCase(tolower(plan_area)),
.keep = "unused")
result = weather_nested$data %>%
lapply(function(x) {
df = idw_interpolation(
points = x,
points_label = c("Station", "Date"),
polys = pa_polys,
polys_label = "plan_area",
ordinal = 15
)
df$Date = x$Date[1]
df
}) %>%
dplyr::bind_rows() %>%
dplyr::left_join(pa_polys, by = "plan_area") %>%
sf::st_as_sf()
result %>%
ggplot(aes(fill = .data[[.var]])) +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 20,
barheight = .4
)) +
labs(title = .title,
subtitle = "Date: {frame_time}") +
theme_void() +
theme(legend.position = "bottom") +
gganimate::transition_time(Date) +
gganimate::ease_aes("linear")
}
plot_heatmap(
data_time,
y = "URTI",
low_color = "seagreen2",
high_color = "honeydew4"
)
plot_heatmap(
data_time,
y = "Diarrhoea",
low_color = "lightyellow",
high_color = "tomato4"
)
In general, the number of cases of URTI and diarrhea are consistent throughout the years
plot_heatmap(
data_time,
y = "HFMD",
low_color = "lightskyblue",
high_color = "darkblue"
)
HFMD saw a sharp decrease from 2019.
plot_heatmap(
data_time,
y = "Dengue",
low_color = "ivory3",
high_color = "deeppink4",
subtitle = "Highest number of Cases in 2020"
)
DF fluctuated over the years, we observed a dip in number of Dengue cases in 2017 and a sudden spike in 2020.
It is worth noting that there seems like a seasonal trend for DF which peaks in Jul – Aug. The decrease in dengue cases in 2017 could be attributed to several reasons such as, increase effort by NEA and the community in response to Zika outbreak in 2016 and perhaps the local population has built up immunity after outbreaks of dengue fever in the past. The spike in 2020 could be explained by the prolong period of time staying at home and the lack of resistance to a newly circulating strain of dengue virus.
plot_choropleths(data_space, c("ncases", "ncases_p100k", "ncases_pkm2"))
From the heat map, we observed that North-east region of Singapore is experiencing a higher number of dengue cases than the other parts of Singapore, with Geylang having the highest number of cases of 372.
plot_choropleths(data_space, c("nhabs", "nhabs_pkm2"))
This pattern corresponds with the number of breeding habitats in Singapore, with higher number of breeding habitats found in the North-east region.
plot_choropleths(data_space, c("mean_rainfall", "med_temp", "med_temp_rng"))
Generally, the temperature and amount of rainfall is quite homogenous in Singapore with Geylang and Hougang having slightly higher temperature than other regions; and South region seems to have more rainfall than other regions too.
# Very expensive process, so we'll do just one
animate_choropleth(
raw_data,
.var = "Med_temp",
.title = "Median temperatures in planning areas (2012-2020)"
)
ggstatsplot::ggbetweenstats()plot_ggbetweenstats(
data = data_space,
x = "region",
y = "ncases",
xlab = "Regions in Singapore",
ylab = "Number of cases"
)
Registered S3 method overwritten by 'broom.mixed':
method from
tidy.gamlss broom
Registered S3 methods overwritten by 'lme4':
method from
cooks.distance.influence.merMod car
influence.merMod car
dfbeta.influence.merMod car
dfbetas.influence.merMod car
There is a significant difference in dengue cases in Singapore in the last 14 days (from 2020-07-28) between the North East (\(\hat{\mu}\) = 127) and West (\(\hat{\mu}\) = 25.6) regions of Singapore (p = 0.045).
plot_ggbetweenstats(
data = data_space,
x = "region",
y = "ncases_p100k",
xlab = "Regions in Singapore",
ylab = "Number of cases (per 100,000)"
)
There is a significant difference in dengue case incidence rates (per 100,000 population) in Singapore in the last 14 days (from 2020-07-28) between the Central (\(\hat{\mu}\) = 209.53) and West (\(\hat{\mu}\) = 16.64) regions of Singapore (p = 0.002).
plot_ggbetweenstats(
data = data_space,
x = "region",
y = "med_temp",
xlab = "Regions in Singapore",
ylab = "Median temperature"
)
There are significant differences in the median temperatures in the past month between the Central (\(\hat{\mu}\) = 28.45) and North (\(\hat{\mu}\) = 27.56) regions (p = 0.017), as well as between the North and North East (\(\hat{\mu}\) = 28.49) regions (p = 0.009).
plot_ggbetweenstats(
data = data_space,
x = "region",
y = "med_temp_rng",
xlab = "Regions in Singapore",
ylab = "Median temperature range"
)
Similar differences were present for the median temperature ranges, with differences between the Central (\(\hat{\mu}\) = 5.28) and North (\(\hat{\mu}\) = 5.93) regions (p <= 0.001), as well as between the North and North East (\(\hat{\mu}\) = 5.15) regions (p = 0.001).
data_space %>%
tibble::as_tibble() %>%
dplyr::select(ncases, nhabs, hdb:landed_properties) %>%
tidyr::drop_na() %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
dplyr::filter(p.value < 0.05) %>%
dplyr::arrange(estimate)
# A tibble: 7 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <int> <dbl>
1 condos_other_apts ncases 0.374 28 0.0498
2 condos_other_apts hdb 0.404 28 0.0329
3 condos_other_apts nhabs 0.420 28 0.0261
4 landed_properties ncases 0.574 28 0.00139
5 landed_properties nhabs 0.646 28 0.000204
6 landed_properties condos_other_apts 0.710 28 0.0000236
7 nhabs ncases 0.839 28 0.0000000239
data_space %>%
tibble::as_tibble() %>%
dplyr::select(ncases, nhabs, mean_rainfall, med_temp, med_temp_rng) %>%
tidyr::drop_na() %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
dplyr::filter(p.value < 0.05) %>%
dplyr::arrange(estimate)
# A tibble: 5 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <int> <dbl>
1 med_temp_rng med_temp -0.618 28 0.000458
2 med_temp_rng nhabs -0.479 28 0.00989
3 med_temp nhabs 0.501 28 0.00663
4 med_temp ncases 0.624 28 0.000384
5 nhabs ncases 0.839 28 0.0000000239
data_space %>%
leaflet::leaflet(width = "100%") %>%
leaflet::addTiles() %>%
leaflet::addPolygons(
weight = 1,
opacity = 1,
fillOpacity = 0.7,
smoothFactor = 0.5,
fillColor = ~leaflet::colorNumeric("Reds", ncases_pkm2)(ncases_pkm2),
label = ~lapply(label, htmltools::HTML),
popup = ~lapply(label, htmltools::HTML)
) %>%
leaflet::addCircleMarkers(
data = data$dengue_points,
color = "red",
radius = 5,
fillOpacity = 0.5,
clusterOptions = leaflet::markerClusterOptions()
) %>%
leaflet::addLabelOnlyMarkers(
data = sf::st_centroid(data_space),
label = ~plan_area,
labelOptions = leaflet::labelOptions(
noHide = T,
textOnly = T,
direction = "center",
style = list("color" = "blue"))
)
# Store models in this list
models <- list()
Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (Dengue)
{par(mfrow = c(3, 2))
hist(data_time$Dengue, breaks=40)
hist(data_time$Mean_rainfall, breaks=40)
hist(data_time$Mean_temp, breaks=40)
hist(data_time$Med_temp, breaks=40)
hist(data_time$Mean_temp_rng, breaks=40)
hist(data_time$Med_temp_rng, breaks=40)
}
ks.test(data_time$Mean_temp, data_time$Med_temp)
Two-sample Kolmogorov-Smirnov test
data: data_time$Mean_temp and data_time$Med_temp
D = 0, p-value = 1
alternative hypothesis: two-sided
ks.test(data_time$Mean_temp_rng, data_time$Med_temp_rng)
Two-sample Kolmogorov-Smirnov test
data: data_time$Mean_temp_rng and data_time$Med_temp_rng
D = 0.065315, p-value = 0.2999
alternative hypothesis: two-sided
Mean_temp and Med_temp have similar distributions, and Mean_temp_rng and Med_temp_rng have similar distributions, therefore Med_temp and Med_temp_rng were used.
Let’s run a correlation model to see the relationship between variables
data_time %>%
dplyr::select(Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
# dplyr::filter(p.value < 0.05) %>%
dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>%
dplyr::arrange(estimate)
# A tibble: 6 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 Med_temp Mean_rainfall -0.499 444 0
2 Med_temp_rng Dengue -0.151 444 0.001
3 Mean_rainfall Dengue -0.052 444 0.278
4 Med_temp_rng Med_temp 0.031 444 0.51
5 Med_temp_rng Mean_rainfall 0.084 444 0.077
6 Med_temp Dengue 0.201 444 0
Run a linear regression model
models[["model1_dengue"]] <-
lm(Dengue ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)
broom::tidy(models$model1_dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1237. 378. -3.27 0.00114
2 Mean_rainfall 4.01 2.46 1.63 0.104
3 Med_temp 61.9 13.2 4.70 0.00000355
4 Med_temp_rng -43.3 12.1 -3.59 0.000369
Let’s check the skewness of y variable
e1071::skewness(data_time$Dengue) # [1] 2.033787
[1] 2.033787
Let’s see how the density plots look like after transformation
{
par(mfrow = c(2, 2))
plot(density(data_time$Dengue, na.rm = T), main = "untransformed")
plot(density(sqrt(data_time$Dengue), na.rm = T), main = "sqrt")
plot(density(log10(data_time$Dengue), na.rm = T), main = "log10")
plot(density(1 / data_time$Dengue, na.rm = T), main = "inverse")
}
Not really good. Try them on Model 1
models[["model1_log_dengue"]] <-
lm(log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)
broom::tidy(models$model1_log_dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.309 0.700 -0.441 0.659
2 Mean_rainfall 0.00294 0.00456 0.644 0.520
3 Med_temp 0.0993 0.0244 4.06 0.0000572
4 Med_temp_rng -0.0407 0.0224 -1.82 0.0695
gvlma::gvlma(models$model1_log_dengue)
Call:
lm(formula = log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-0.309087 0.002939 0.099330 -0.040711
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_log_dengue)
Value p-value Decision
Global Stat 28.1144 1.182e-05 Assumptions NOT satisfied!
Skewness 1.9927 1.581e-01 Assumptions acceptable.
Kurtosis 17.8056 2.447e-05 Assumptions NOT satisfied!
Link Function 0.1611 6.882e-01 Assumptions acceptable.
Heteroscedasticity 8.1550 4.294e-03 Assumptions NOT satisfied!
models[["model1_sqrt_dengue"]] <-
lm(sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)
broom::tidy(models$model1_sqrt_dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -28.7 11.2 -2.57 0.0105
2 Mean_rainfall 0.0782 0.0727 1.08 0.283
3 Med_temp 1.74 0.390 4.47 0.00000986
4 Med_temp_rng -0.974 0.357 -2.73 0.00655
gvlma::gvlma(models$model1_sqrt_dengue)
Call:
lm(formula = sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-28.71254 0.07821 1.74274 -0.97433
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_sqrt_dengue)
Value p-value Decision
Global Stat 32.7985 1.313e-06 Assumptions NOT satisfied!
Skewness 22.7656 1.830e-06 Assumptions NOT satisfied!
Kurtosis 0.2636 6.077e-01 Assumptions acceptable.
Link Function 0.5967 4.398e-01 Assumptions acceptable.
Heteroscedasticity 9.1726 2.457e-03 Assumptions NOT satisfied!
models[["model1_inv_dengue"]] <-
lm((1 / Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)
broom::tidy(models$model1_inv_dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0529 0.0141 3.76 0.000193
2 Mean_rainfall -0.0000488 0.0000916 -0.533 0.594
3 Med_temp -0.00164 0.000491 -3.34 0.000896
4 Med_temp_rng 0.000310 0.000449 0.689 0.491
gvlma::gvlma(models$model1_inv_dengue)
Call:
lm(formula = (1/Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
0.0528561 -0.0000488 -0.0016408 0.0003096
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_inv_dengue)
Value p-value Decision
Global Stat 200.447 0.000e+00 Assumptions NOT satisfied!
Skewness 133.155 0.000e+00 Assumptions NOT satisfied!
Kurtosis 41.572 1.136e-10 Assumptions NOT satisfied!
Link Function 1.142 2.853e-01 Assumptions acceptable.
Heteroscedasticity 24.578 7.135e-07 Assumptions NOT satisfied!
Let’s have a look at the models’ adjusted R2 values
list(
"untransformed" = models$model1_dengue,
"log" = models$model1_log_dengue,
"sqrt" = models$model1_sqrt_dengue,
"inv" = models$model1_inv_dengue
) %>%
lapply(broom::glance) %>%
{ dplyr::bind_cols(names(.), dplyr::bind_rows(.)) } %>%
dplyr::rename(transformation = ...1)
New names:
* NA -> ...1
# A tibble: 4 x 13
transformation r.squared adj.r.squared sigma statistic p.value df logLik
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 untransformed 0.0709 0.0646 2.02e+2 11.2 4.34e-7 3 -2985.
2 log 0.0472 0.0407 3.75e-1 7.26 9.19e-5 3 -193.
3 sqrt 0.0599 0.0535 5.98e+0 9.34 5.34e-6 3 -1422.
4 inv 0.0292 0.0226 7.53e-3 4.41 4.53e-3 3 1543.
# ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
It seems that models$model1_dengue (untransformed) has a higher adjusted R2 value than the other models.
Plot the regression models
data_time %>%
dplyr::select(Dengue, Med_temp, Med_temp_rng, Mean_rainfall) %>%
# scale() %>%
tibble::as_tibble() %>%
tidyr::pivot_longer(-Dengue) %>%
ggplot(aes(x = value, y = Dengue, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
facet_grid(. ~ name, scales = "free") +
theme(legend.position = "none")
Since rainfall has marginal significance at an alpha of 10% and known to have effect on vector born diseases, we decided to explore rainfall variable by categorizing it (<11mm of rain fall and >=11mm of rainfall)
data_time <- data_time %>%
dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))
dplyr::glimpse(data_time)
Rows: 444
Columns: 15
$ Epiyear <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
$ Rain_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,...
Re-Run Model1 With categorized Rain variable
models[["model2_dengue"]] <-
lm(Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)
gvlma::gvlma(models$model2_dengue)
Call:
lm(formula = Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)
Coefficients:
(Intercept) Rain_11 Med_temp Med_temp_rng
-1219.75 62.59 61.56 -42.21
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model2_dengue)
Value p-value Decision
Global Stat 803.970 0.000e+00 Assumptions NOT satisfied!
Skewness 218.166 0.000e+00 Assumptions NOT satisfied!
Kurtosis 556.830 0.000e+00 Assumptions NOT satisfied!
Link Function 2.482 1.151e-01 Assumptions acceptable.
Heteroscedasticity 26.493 2.645e-07 Assumptions NOT satisfied!
broom::tidy(models$model2_dengue)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1220. 351. -3.48 0.000556
2 Rain_11 62.6 28.1 2.22 0.0266
3 Med_temp 61.6 12.3 5.01 0.000000805
4 Med_temp_rng -42.2 12.0 -3.53 0.000463
Add Interaction term
models[["model2_dengue_int"]] <-
lm(Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)
gvlma::gvlma(models$model2_dengue_int)
Call:
lm(formula = Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)
Coefficients:
(Intercept) Med_temp Med_temp_rng
-1187.07 56.90 -26.90
Rain_11 Med_temp:Rain_11 Med_temp_rng:Rain_11
-1861.85 97.09 -111.60
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model2_dengue_int)
Value p-value Decision
Global Stat 513.22 0.000e+00 Assumptions NOT satisfied!
Skewness 166.77 0.000e+00 Assumptions NOT satisfied!
Kurtosis 310.38 0.000e+00 Assumptions NOT satisfied!
Link Function 16.41 5.092e-05 Assumptions NOT satisfied!
Heteroscedasticity 19.65 9.293e-06 Assumptions NOT satisfied!
broom::tidy(models$model2_dengue_int)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1187. 382. -3.10 0.00204
2 Med_temp 56.9 13.1 4.34 0.0000179
3 Med_temp_rng -26.9 13.3 -2.02 0.0435
4 Rain_11 -1862. 1006. -1.85 0.0649
5 Med_temp:Rain_11 97.1 39.4 2.46 0.0142
6 Med_temp_rng:Rain_11 -112. 32.7 -3.41 0.000702
anova(models$model2_dengue,models$model2_dengue_int)
Analysis of Variance Table
Model 1: Dengue ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: Dengue ~ (Med_temp + Med_temp_rng) * Rain_11
Res.Df RSS Df Sum of Sq F Pr(>F)
1 440 17909105
2 438 17396320 2 512786 6.4554 0.001726 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(models$model2_dengue)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0757 0.0694 202. 12.0 1.44e-7 3 -2984. 5979. 5999.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(models$model2_dengue_int)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.102 0.0919 199. 9.97 4.82e-9 5 -2978. 5970. 5998.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
There is an interaction effect (Rain_11 on 2 other predictors) and Anova test shows adding the interaction terms improves the model fit (p value=0.0017). This interaction model has a higher R2 value too.
r.squared(models$Model2_dengue) –> 0.0757
r.squared(models$Model2_dengue_int) –> 0.102
adj.r.squared(models$Model2_dengue) –> 0.0694
adj.r.squared(models$Model2_dengue_int) –> 0.0919
Unfortunately, all model assumptions failed. So This model won’t be a good one for prediction.
autoplot(models$model2_dengue_int)
Lets explore this interaction model
jtools::export_summs(
models$model2_dengue,
models$model2_dengue_int,
error_format = "(p = {p.value})",
model.names = c("Main Effects Only", "Interaction Effects"),
digits = 3
)
| Main Effects Only | Interaction Effects | |
|---|---|---|
| (Intercept) | -1219.750 *** | -1187.070 ** |
| (p = 0.001) | (p = 0.002) | |
| Rain_11 | 62.587 * | -1861.847 |
| (p = 0.027) | (p = 0.065) | |
| Med_temp | 61.565 *** | 56.900 *** |
| (p = 0.000) | (p = 0.000) | |
| Med_temp_rng | -42.213 *** | -26.895 * |
| (p = 0.000) | (p = 0.044) | |
| Med_temp:Rain_11 | 97.086 * | |
| (p = 0.014) | ||
| Med_temp_rng:Rain_11 | -111.602 *** | |
| (p = 0.001) | ||
| N | 444 | 444 |
| R2 | 0.076 | 0.102 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Probing Interaction Effects
Run Spotlight analysis, using Johnson-Neyman Techniques
Median temperature
interactions::sim_slopes(
models$model2_dengue_int,
pred = Rain_11,
modx = Med_temp,
johnson_neyman = T
) # [23.43, 27.15]
JOHNSON-NEYMAN INTERVAL
When Med_temp is OUTSIDE the interval [23.43, 27.15], the slope of Rain_11
is p < .05.
Note: The range of observed values of Med_temp is [25.06, 29.96]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp = 27.11 (- 1 SD):
Est. S.E. t val. p
------- ------- -------- ------
52.49 28.96 1.81 0.07
Slope of Rain_11 when Med_temp = 27.96 (Mean):
Est. S.E. t val. p
-------- ------- -------- ------
134.46 38.04 3.53 0.00
Slope of Rain_11 when Med_temp = 28.80 (+ 1 SD):
Est. S.E. t val. p
-------- ------- -------- ------
216.42 65.35 3.31 0.00
interactions::interact_plot(
models$model2_dengue_int,
pred="Med_temp",
modx = "Rain_11",
modx.labels = c("less than 11mm of rain fall", "rain fall >=11mm"),
interval = T,
int.width = .95,
colors = c("deeppink", "darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = ""
) +
geom_vline(xintercept = 27.15, col = "red", linetype = 3, size = 1) +
labs(
title = "Effect of Rain fall and Temperature on Dengue cases in Singapore",
subtitle = paste0(
"With an average weekly rainfall of 11mm and above, the number of \n",
"Dengue cases increases with rise in temperature"
),
x = "Weekly Median Temperature",
y = "Number of Dengue Cases",
caption = "Source: MOH, NEA Websites"
) +
annotate(
"text",
x = 28,
y = 0,
label = paste(
"The shaded areas denote 95% confidence intervals.",
"The vertical line marks the boundary between",
"regions of significance and non-significance",
"based on alpha at 5%",
sep = "\n"
)
) +
theme(legend.position = "top")
Median Temperature Range
interactions::sim_slopes(
models$model2_dengue_int,
pred = Rain_11,
modx = Med_temp_rng,
johnson_neyman = T
) # [7.01, 9.04]
JOHNSON-NEYMAN INTERVAL
When Med_temp_rng is OUTSIDE the interval [7.01, 9.04], the slope of
Rain_11 is p < .05.
Note: The range of observed values of Med_temp_rng is [3.50, 8.50]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp_rng = 5.63 (- 1 SD):
Est. S.E. t val. p
-------- ------- -------- ------
223.99 53.48 4.19 0.00
Slope of Rain_11 when Med_temp_rng = 6.43 (Mean):
Est. S.E. t val. p
-------- ------- -------- ------
134.46 38.04 3.53 0.00
Slope of Rain_11 when Med_temp_rng = 7.24 (+ 1 SD):
Est. S.E. t val. p
------- ------- -------- ------
44.92 37.56 1.20 0.23
interactions::interact_plot(
models$model2_dengue_int,
pred="Med_temp_rng",
modx = "Rain_11",
modx.labels = c("less than 11mm of rain fall", "rain fall >=11mm"),
interval = T,
int.width = .95,
colors = c("deeppink", "darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = "Rain Fall"
) +
geom_vline(xintercept = 7.01, col = "red", linetype = 3, size = 1) +
labs(
title = paste0(
"Effect of Rain fall and Temperature Variation ",
"on Dengue Cases in Singapore"
),
subtitle = paste0(
"With an average weekly rainfall of 11mm and above, the number of \n",
"Dengue cases decreases when tempeature variation is HIGH"
),
x = "Weekly Median Temperature Range",
y = "Number of Dengue Cases",
caption = "Source: MOH, NEA Websites"
) +
annotate(
"text",
x = 6,
y = 100,
label = paste(
"The shaded areas denote 95% confidence intervals.",
"The vertical line marks the boundary between",
"regions of significance and non-significance",
"based on alpha at 5%",
sep = "\n"
)
) +
theme(legend.position = "top")
Since the number of URTI cases in Singapore drastically reduced due to COVID-19, we decided to exclude the values since April 2020 (from circuit breaker period), for this analysis.
Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (URTI)
data_time_URTI <- data_time %>%
dplyr::filter(URTI >= 2000)
{
par(mfrow = c(2, 2))
hist(data_time_URTI$Mean_rainfall, breaks=40)
hist(data_time_URTI$Med_temp, breaks=40)
hist(data_time_URTI$Med_temp_rng, breaks=40)
hist(data_time_URTI$URTI, breaks=40)
}
Let’s run a correlation model to see the relationship between variables
data_time_URTI %>%
dplyr::select(URTI, Mean_rainfall, Med_temp, Med_temp_rng) %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
# dplyr::filter(p.value < 0.05) %>%
dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>%
dplyr::arrange(estimate)
# A tibble: 6 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 Med_temp Mean_rainfall -0.508 429 0
2 Med_temp URTI -0.195 429 0
3 Med_temp_rng URTI -0.086 429 0.075
4 Mean_rainfall URTI -0.042 429 0.383
5 Med_temp_rng Med_temp 0.027 429 0.577
6 Med_temp_rng Mean_rainfall 0.091 429 0.06
Run a simple linear regression model
models[["model1_URTI"]] <-
lm(URTI ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time_URTI)
broom::tidy(models$model1_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6804. 729. 9.34 5.59e-19
2 Mean_rainfall -15.7 4.74 -3.31 1.01e- 3
3 Med_temp -133. 25.5 -5.22 2.75e- 7
4 Med_temp_rng -30.5 23.2 -1.31 1.91e- 1
gvlma::gvlma(models$model1_URTI)
Call:
lm(formula = URTI ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = data_time_URTI)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
6804.45 -15.68 -132.97 -30.45
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_URTI)
Value p-value Decision
Global Stat 119.874 0.000e+00 Assumptions NOT satisfied!
Skewness 48.234 3.782e-12 Assumptions NOT satisfied!
Kurtosis 48.947 2.630e-12 Assumptions NOT satisfied!
Link Function 3.667 5.550e-02 Assumptions acceptable.
Heteroscedasticity 19.026 1.290e-05 Assumptions NOT satisfied!
Assumption checks failed
Let’s check the skewness of y variable
e1071::skewness(data_time_URTI$URTI) # [1] 0.8634216
[1] 0.8634216
#Lets see how the density plots look like after transformation
{
par(mfrow = c(2, 2))
plot(density(data_time_URTI$URTI), main = "untransformed")
plot(density(sqrt(data_time_URTI$URTI)), main = "sqrt")
plot(density(log10(data_time_URTI$URTI)), main = "log10")
plot(density(1 / data_time_URTI$URTI), main = "inverse")
}
Try them on Model 1
models[["model1_log_URTI"]] <- data_time_URTI %>%
lm(log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)
broom::tidy(models$model1_log_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.03 0.110 36.8 3.38e-134
2 Mean_rainfall -0.00232 0.000713 -3.26 1.22e- 3
3 Med_temp -0.0195 0.00383 -5.09 5.50e- 7
4 Med_temp_rng -0.00523 0.00350 -1.50 1.35e- 1
gvlma::gvlma(models$model1_log_URTI)
Call:
lm(formula = log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = .)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
4.034784 -0.002320 -0.019479 -0.005233
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_log_URTI)
Value p-value Decision
Global Stat 22.357 0.0001701 Assumptions NOT satisfied!
Skewness 8.284 0.0039991 Assumptions NOT satisfied!
Kurtosis 1.968 0.1606558 Assumptions acceptable.
Link Function 3.159 0.0755032 Assumptions acceptable.
Heteroscedasticity 8.946 0.0027807 Assumptions NOT satisfied!
models[["model1_sqrt_URTI"]] <- data_time_URTI %>%
lm(sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)
broom::tidy(models$model1_sqrt_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 89.6 6.76 13.3 7.25e-34
2 Mean_rainfall -0.144 0.0439 -3.29 1.09e- 3
3 Med_temp -1.22 0.236 -5.16 3.73e- 7
4 Med_temp_rng -0.303 0.215 -1.41 1.60e- 1
gvlma::gvlma(models$model1_sqrt_URTI)
Call:
lm(formula = sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = .)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
89.6374 -0.1444 -1.2182 -0.3032
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_sqrt_URTI)
Value p-value Decision
Global Stat 52.462 1.104e-10 Assumptions NOT satisfied!
Skewness 23.091 1.545e-06 Assumptions NOT satisfied!
Kurtosis 12.591 3.877e-04 Assumptions NOT satisfied!
Link Function 3.415 6.462e-02 Assumptions acceptable.
Heteroscedasticity 13.365 2.563e-04 Assumptions NOT satisfied!
models[["model1_inv_URTI"]] <- data_time_URTI %>%
lm((1 / URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)
broom::tidy(models$model1_inv_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.000108 0.0000902 -1.20 0.231
2 Mean_rainfall 0.00000185 0.000000586 3.16 0.00168
3 Med_temp 0.0000154 0.00000315 4.89 0.00000143
4 Med_temp_rng 0.00000475 0.00000288 1.65 0.0997
gvlma::gvlma(models$model1_inv_URTI)
Call:
lm(formula = (1/URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng,
data = .)
Coefficients:
(Intercept) Mean_rainfall Med_temp Med_temp_rng
-1.083e-04 1.854e-06 1.541e-05 4.745e-06
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model1_inv_URTI)
Value p-value Decision
Global Stat 6.13667 0.18917 Assumptions acceptable.
Skewness 0.31090 0.57713 Assumptions acceptable.
Kurtosis 0.03011 0.86224 Assumptions acceptable.
Link Function 2.65447 0.10326 Assumptions acceptable.
Heteroscedasticity 3.14119 0.07634 Assumptions acceptable.
Assumptions acceptable
autoplot(models$model1_inv_URTI)
Plot the regression model 1
data_time_URTI %>%
dplyr::mutate(URTI_inv = (1 / URTI)) %>%
dplyr::select(URTI_inv, Med_temp, Med_temp_rng, Mean_rainfall) %>%
# scale() %>%
tibble::as_tibble() %>%
tidyr::pivot_longer(-URTI_inv) %>%
ggplot(aes(x = value, y = URTI_inv, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
facet_grid(. ~ name, scales = "free") +
theme(legend.position = "none")
From the plot, there are more cases when rain fall is lesser. So we decided to explore rainfall variable by categorizing it (<11cm of rain fall and >=11cm of rainfall).
data_time_URTI <- data_time_URTI %>%
dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))
dplyr::glimpse(data_time_URTI)
Rows: 429
Columns: 15
$ Epiyear <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
$ Rain_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,...
Re-Run Model1 With categorized Rain variable
models[["model2_inv_URTI"]] <-
lm((1 / URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)
gvlma::gvlma(models$model2_inv_URTI)
Call:
lm(formula = (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)
Coefficients:
(Intercept) Rain_11 Med_temp Med_temp_rng
-3.362e-05 1.536e-05 1.290e-05 5.447e-06
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model2_inv_URTI)
Value p-value Decision
Global Stat 7.11459 0.1300 Assumptions acceptable.
Skewness 0.06411 0.8001 Assumptions acceptable.
Kurtosis 0.01816 0.8928 Assumptions acceptable.
Link Function 3.41266 0.0647 Assumptions acceptable.
Heteroscedasticity 3.61966 0.0571 Assumptions acceptable.
broom::tidy(models$model2_inv_URTI)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0000336 0.0000842 -0.399 0.690
2 Rain_11 0.0000154 0.00000683 2.25 0.0251
3 Med_temp 0.0000129 0.00000296 4.36 0.0000162
4 Med_temp_rng 0.00000545 0.00000288 1.89 0.0589
Add Interaction term
models[["model2_inv_URTI_int"]] <-
lm((1 / URTI) ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time_URTI)
gvlma::gvlma(models$model2_inv_URTI_int)
Call:
lm(formula = (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11,
data = data_time_URTI)
Coefficients:
(Intercept) Med_temp Med_temp_rng
-1.161e-04 1.590e-05 5.151e-06
Rain_11 Med_temp:Rain_11 Med_temp_rng:Rain_11
8.132e-04 -3.302e-05 1.568e-05
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = models$model2_inv_URTI_int)
Value p-value Decision
Global Stat 4.8987 0.29786 Assumptions acceptable.
Skewness 0.1427 0.70559 Assumptions acceptable.
Kurtosis 0.3799 0.53765 Assumptions acceptable.
Link Function 0.7336 0.39173 Assumptions acceptable.
Heteroscedasticity 3.6424 0.05632 Assumptions acceptable.
broom::tidy(models$model2_inv_URTI_int)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.000116 0.0000917 -1.27 0.206
2 Med_temp 0.0000159 0.00000315 5.05 0.000000645
3 Med_temp_rng 0.00000515 0.00000318 1.62 0.106
4 Rain_11 0.000813 0.000254 3.21 0.00144
5 Med_temp:Rain_11 -0.0000330 0.0000101 -3.27 0.00116
6 Med_temp_rng:Rain_11 0.0000157 0.00000837 1.87 0.0617
anova(models$model2_inv_URTI,models$model2_inv_URTI_int)
Analysis of Variance Table
Model 1: (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11
Res.Df RSS Df Sum of Sq F Pr(>F)
1 425 9.6001e-07
2 423 9.3611e-07 2 2.3896e-08 5.3988 0.004839 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(models$model2_inv_URTI)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0534 0.0467 4.75e-5 7.99 3.43e-5 3 3664. -7317. -7297.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(models$model2_inv_URTI_int)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0769 0.0660 4.70e-5 7.05 2.40e-6 5 3669. -7324. -7296.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Median temperature is significant at an alpha level of 5%. There is an interaction effect too (Rain_11 on Med_temp) and Anova test shows adding the interaction terms improves the model fit (p value = 0.004838). This interaction model has a higher R2 value too.
r.squared(models$model2_inv_URTI) –> 0.053
r.squared(models$model2_inv_URTI_int) –> 0.077
adj.r.squared(models$model2_inv_URTI) –> 0.0467
adj.r.squared(models$model2_inv_URTI_int) –> 0.0660
jtools::export_summs(
models$model2_inv_URTI,models$model2_inv_URTI_int,
error_format = "(p = {p.value})",
model.names = c("Main Effects Only", "Interaction Effects"),
digits = 3
)
| Main Effects Only | Interaction Effects | |
|---|---|---|
| (Intercept) | -0.000 | -0.000 |
| (p = 0.690) | (p = 0.206) | |
| Rain_11 | 0.000 * | 0.001 ** |
| (p = 0.025) | (p = 0.001) | |
| Med_temp | 0.000 *** | 0.000 *** |
| (p = 0.000) | (p = 0.000) | |
| Med_temp_rng | 0.000 | 0.000 |
| (p = 0.059) | (p = 0.106) | |
| Med_temp:Rain_11 | -0.000 ** | |
| (p = 0.001) | ||
| Med_temp_rng:Rain_11 | 0.000 | |
| (p = 0.062) | ||
| N | 429 | 429 |
| R2 | 0.053 | 0.077 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Probing Interaction Effects
# Run Spotlight analysis, using Johnson-Neyman Techniques
# Median Temperature
# interactions::sim_slopes(
# models$model2_inv_URTI_int,
# pred = Rain_11,
# modx = Med_temp,
# johnson_neyman = T
# ) # [27.28, 28.61]
# Run interaction_plot() by adding benchmark for regions of significance
# Use plot B to see the actual URTI numbers instead of y inverse
# interactions::interact_plot(
# models$model2_inv_URTI_int,
# data = data_time_URTI,
# pred = "Med_temp",
# modx = "Rain_11",
# modx.labels = c("less than 11cm of rain fall", "rain fall >=11cm"),
# interval = T,
# int.width = .95,
# colors = c("deeppink","darkgreen"),
# vary.lty = T,
# line.thickness = 1,
# legend.main = ""
# ) +
# geom_vline(xintercept = 27.28, col = "red", linetype = 3, size = 1) +
# geom_vline(xintercept = 28.61, col = "red", linetype = 3, size = 1) +
# labs(
# title = paste0(
# "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore"
# ),
# subtitle = paste0(
# "For lower weekly rainfall, the number of URTI cases ",
# "decreases with rise in temperature"
# ),
# x = "Weekly Median Temperature",
# y = "1/Number of URTI Cases",
# caption = "Source: MOH, NEA websites"
# ) +
# annotate(
# "text",
# x = 28,
# y = 0.0003,
# label = paste(
# "The shaded areas denote 95% confidence intervals.",
# "The vertical lines mark the boundary between",
# "regions of significance and non-significance",
# "based on alpha at 5%",
# sep = "\n"
# )
# ) +
# theme(legend.position = "top")
##############################################
#This model model2_URTI is just for plotting purpose.
models[["model2_URTI_int"]] <-
lm(URTI ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time_URTI)
interactions::sim_slopes(
models$model2_URTI_int,
pred = Rain_11,
modx = Med_temp,
johnson_neyman = T
) # [27.24, 28.48]
JOHNSON-NEYMAN INTERVAL
When Med_temp is OUTSIDE the interval [27.24, 28.48], the slope of Rain_11
is p < .05.
Note: The range of observed values of Med_temp is [25.06, 29.96]
SIMPLE SLOPES ANALYSIS
Slope of Rain_11 when Med_temp = 27.09 (- 1 SD):
Est. S.E. t val. p
--------- ------- -------- ------
-149.99 56.54 -2.65 0.01
Slope of Rain_11 when Med_temp = 27.94 (Mean):
Est. S.E. t val. p
------- ------- -------- ------
79.97 79.73 1.00 0.32
Slope of Rain_11 when Med_temp = 28.78 (+ 1 SD):
Est. S.E. t val. p
-------- -------- -------- ------
309.93 137.96 2.25 0.03
interactions::interact_plot(
models$model2_URTI_int,
pred="Med_temp",
modx = "Rain_11",
modx.labels = c("less than 11cm of rain fall", "rain fall >=11cm"),
interval = T,
int.width = .95,
colors = c("deeppink", "darkgreen"),
vary.lty = T,
line.thickness = 1,
legend.main = ""
) +
geom_vline(xintercept = 27.24, col = "red", linetype = 3, size = 1) +
geom_vline(xintercept = 28.48, col = "red", linetype = 3, size = 1) +
labs(
title =
"Effect of Rain fall and Temperature on rise in URTI Cases in Singapore",
subtitle = paste0(
"For lower weekly rainfall, the number of URTI cases ",
"decreases with rise in temperature"
),
x = "Weekly Median Temperature",
y = "Number of URTI Cases",
caption = "Source: MOH, NEA websites"
) +
annotate(
"text",
x = 28,
y = 2000,
label = paste(
"The shaded areas denote 95% confidence intervals.",
"The vertical lines mark the boundary between",
"regions of significance and non-significance",
"based on alpha at 5%",
sep = "\n"
)
) +
theme(legend.position = "top")
In the modeling analyses, all models for the 4 clinical conditions except URTI failed the assumption checks. Despite failing the checks, we continued to test several modeling for DF as we are interested in the disease. The results below will be focusing on URTI and DF. We observed the following results:
It is expected that there will be an increase in URTI cases when the air temperature is low as people are more susceptible to common cold during cold weather. We observed that the number of URTI cases increase when the air temperature is low with low rainfall but a decrease in cases during low temperature with high rainfall. One of the reasons for this phenomenon could be people may wear more layers when the rain is heavier to protect them from the rain and cold.
However, it is interesting to note that there is an increase in number of cases for DF and URTI during high air temperature with high rainfall (>11mm). The increase in DF in warmer temperature is likely attributed to the optimal temperature for mosquito to develop, research has shown that mosquito replicates faster at higher temperature during incubation period (Liu et al., 2017). The increase in URTI under high temperature is interesting and worthy of further exploration especially in tropical countries, understanding the aetiology would be helpful in reducing the cases. A possible explanation suggested for this was indoor transmission in air-conditioned environments accounted for most of the transmission (Tang, Wong and Hon, 2010).
Dengue fever was observed to decrease in high rainfall and low temperature. We understand that precipitation is associated to increase incidence of dengue fever. However, our results seemed to support a recent study that claimed that heavy rainfall could disrupt breeding habitat and mosquitoes’ reproductive cycle by washing away by the rain (Benedum, Seidahmed, Eltahir, Markuzon, 2018).
These findings contribute to the evidence that weather plays a role in DF and URTI and there is no seasonal trend in URTI in Singapore. Therefore, it is important to maintain good personal hygiene at all times.
With homogeneous weather conditions in Singapore, it is an interesting observation that cases of DF cluster in North East of Singapore. This could be due to several reasons:
This finding suggests that NEA should step up their effort in reducing the number of breeding habitat in North East of Singapore.
We have identified some potential pitfalls for our project.
Future research could explore creating model to predict the spread and susceptibility of dengue fever and even the effectiveness of intervention to help in preventing an outbreak.
SY devised the project and the main conceptual ideas. SY, RL, and AS searched for the main sources of data. RL created the data pipeline from source to RAM. SY and RL performed the exploratory visualizations and their interpretations. SY and AS planned out the directions for analyses. AS worked out the statistical analyses and created the statistical models. SY provided the summative interpretations as well as links to previous research. RL handled the technical details, and maintained the source code. SY, AS, and RL wrote and approved of the final version of the RMarkdown file. Knitr generated the final report.